Bayesian Methods for Frequent Terms in Text: Models of Contagion and the ∆ Statistic

نویسندگان

  • Edoardo M. Airoldi
  • William W. Cohen
  • Stephen E. Fienberg
چکیده

Most statistical approaches to modeling text implicitly assume that informative words are rare. This assumption is usually appropriate for topical retrieval and classification tasks; however, in non-topical classification and soft-clustering problems where classes and latent variables relate to sentiment or author, informative words can be frequent. In this paper we present a comprehensive set of statistical learning tools which treat words with higher frequencies of occurrence in a sensible manner. We introduce probabilistic models of contagion for classification and soft-clustering based on the Poisson and Negative-Binomial distributions, which share with the Multinomial the desirable properties of simplicity and analytic tractability. We then introduce the ∆ statistic to select features and avoid over-fitting.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Bayesian Melding of Deterministic Models and Kriging for Analysis of Spatially Dependent Data

The link between geographic information systems and decision making approach own the invention and development of spatial data melding method. These methods combine different data sets, to achieve better results. In this paper, the Bayesian melding method for combining the measurements and outputs of deterministic models and kriging are considered. Then the ozone data in Tehran city are analyze...

متن کامل

Bayesian Two-Sample Prediction with Progressively Type-II Censored Data for Some Lifetime Models

Prediction on the basis of censored data is very important topic in many fields including medical and engineering sciences. In this paper, based on progressive Type-II right censoring scheme, we will discuss Bayesian two-sample prediction. A general form for lifetime model including some well known and useful models such asWeibull and Pareto is considered for obtaining prediction bounds ...

متن کامل

Using Fuzzy LR Numbers in Bayesian Text Classifier for Classifying Persian Text Documents

Text Classification is an important research field in information retrieval and text mining. The main task in text classification is to assign text documents in predefined categories based on documents’ contents and labeled-training samples. Since word detection is a difficult and time consuming task in Persian language, Bayesian text classifier is an appropriate approach to deal with different...

متن کامل

Using Fuzzy LR Numbers in Bayesian Text Classifier for Classifying Persian Text Documents

Text Classification is an important research field in information retrieval and text mining. The main task in text classification is to assign text documents in predefined categories based on documents’ contents and labeled-training samples. Since word detection is a difficult and time consuming task in Persian language, Bayesian text classifier is an appropriate approach to deal with different...

متن کامل

Author gender identification from text using Bayesian Random Forest

Nowadays high usage of users from virtual environments and their connection via social networks like Facebook, Instagram, and Twitter shows the necessity of finding out shared subjects in this environment more than before. There are several applications that benefit from reliable methods for inferring age and gender of users in social media. Such applications exist across a wide area of fields,...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005